
Yelp is a business directory service and crowd-sourced review forum, and a public company of the same name that is headquartered in San Francisco, California. Many users post their reviews on various businesses on Yelp that help the community get a better idea of the performance of the businesses, and thus enable them to make better decisions.
According to the Dataset License, we can't use the data in connection with any commercial purpose or use the data for any purpose that's against Yelp. We can't also transform the usage of the data to other people.
The business.json file contains business data including location data, attributes, and categories. The review.json file contains full review text data including the user_id that wrote the review and the business_id the review is written for.
import pandas as pd
import geopandas as gpd
import json
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
import seaborn as sns
from json_to_csv_converter1 import read_and_write_file, get_superset_of_column_names_from_file, get_column_names, \
get_nested_value, get_row
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from nltk.corpus import stopwords
import numpy as np
Since all the data are saved in json format, I used the json to csv converter that is given as an example in the Yelp Data Challenge documentation page https://github.com/Yelp/dataset-examples.
json = 'yelp_dataset/business.json'
csv = 'business.csv'
column_names = get_superset_of_column_names_from_file('yelp_dataset/business.json')
read_and_write_file(json, csv, column_names)
business = pd.read_csv("business.csv")
business
To study the pattern of the cities this dataset encompass, I first want to get an idea about how the cities locate in this dataset. As mentioned in the introduction of the challenge, the dataset contains businesses and review of 10 metropolitan areas. Here I extracted all the unique state names of all business.
business['state'].unique()
I notice that the 2-letter area codes represent the states of the US and the provinces of Canada. However, the 3-letter area codes seem very unfamiliar to me. Thus, I extract all the business with the state attribute from the 3-letter area codes above and take a closer look of those businesses. There are only 11 businesses from those regions, and by searching by the names of the businesses on Yelp, I conclude that they are from the UK. However, since there are only 11 businesses out of 192,609 businesses having the uncommon state attributes, I decide to remove them from the dataset for further studying.
business[(business['state'] == 'BAS') | (business['state'] == 'XGM') | (business['state'] == 'XWY') |
(business['state'] == 'CON') | (business['state'] == 'DUR') |(business['state'] == 'XGL') |
(business['state'] == 'DOW')]
business = business[~(business['state'] == 'BAS') & ~(business['state'] == 'XGM') & ~(business['state'] == 'XWY') &
~(business['state'] == 'CON') & ~(business['state'] == 'DUR') & ~(business['state'] == 'XGL') &
~(business['state'] == 'DOW')]
Re-examining all the states and provinces in this dataset, I notice that there are 18 of the states having less than 25 businesses. Thus, I decide to remove them when studying about the state and city composition in the dataset. I create a new variable called state_count to record the number of businesses in each state and removes all the states with less than 22 business.
business['state_count'] = business.groupby('state')['state'].transform('count')
state_count = business[['state','state_count']].drop_duplicates(subset=['state'])
state_count
state_count = state_count[state_count['state_count']>22]
state_count
import plotly.graph_objects as go
# Load data frame and tidy it.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2011_us_ag_exports.csv')
fig = go.Figure(data=go.Choropleth(
locations=state_count['state'], # Spatial coordinates
z = state_count['state_count'].astype(float), # Data to be color-coded
locationmode = 'USA-states', # set of locations match entries in `locations`
colorscale = 'Reds',
colorbar_title = "Number of Business",
))
fig.update_layout(
title_text = 'Businesses Located in the US by States',
geo_scope='usa', # limite map scope to USA
)
fig.show()
From the map above, I am only going to study the number of businesses in each state in the US. Arizona has the most number of businesses in the dataset, with more than 56,000 businesses. Nevada has the second most number of businesses, more than 36,000. Since these two states are geographically located next to each other, we may consider them as a mega metropilitian area. I also notice that there are three more pairs of states in this dataset. They are Wisconsin and Illinois, Ohio and Pennsylvania, North Carolina and South Carolina.
Due to the missing political boundary information in Plotly library, I do not plot the number of businesses as bubble map for provinces in Canada. By looking at the state_count dataframe. There are three Canadian provinces, and they are Ontario, Quebec and Alberta. Ontario has around 33,000 businesses, Quebec has around 9,000 businesses and Alberta has around 8,000 business. If they are plotted in the map, Ontario should be shaded in red, while Quebec and Alberta should be shaded in light orange.
Overall, in this dataset after removing the outliers, there are five states and provinces located around the Great Lakes area. There are two states located in the southeast corner of the US and two states located in the southwest corner.
Now let's look at how the cities located in the dataset. I create a new variable called city_count to record the number of businesses in each city and create a new dataframe city_count that only stores the cities and the number of businesses in those cities. I am only going to study the cities with top 20 number of businesses, so I sort the dataframe by city_count and extract the top 20 cities.
business['city_count'] = business.groupby('city')['city'].transform('count')
city_count = business[['city','city_count','state']].drop_duplicates(subset=['city'])
city_count = city_count.sort_values(by=['city_count'], ascending=False)
city_count = city_count.head(20).reset_index(drop=True)
city_count['city_count'] = city_count['city_count'].astype(int)
city_count
I decide to plot a bubble map of the number of businesses of the top 20 cities to visualize how these cities located and the number of their businesses to find out if there is any relationship in between. First, I use the OpenCageGeocode API to load the longitude and latitude of each city into the city_count dataframe.
from opencage.geocoder import OpenCageGeocode
key = '5e2e7dae71c64a8c8c882ff05245424f' # get api key from: https://opencagedata.com
geocoder = OpenCageGeocode(key)
list_lat = [] # create empty lists
list_long = []
for index, row in city_count.iterrows(): # iterate over rows in dataframe
City = row['city']
State = row['state']
query = str(City)+','+str(State)
results = geocoder.geocode(query)
lat = results[0]['geometry']['lat']
long = results[0]['geometry']['lng']
list_lat.append(lat)
list_long.append(long)
city_count['latitude'] = list_lat
city_count['longtitude'] = list_long
city_count
For the bubble map, I classify the cities into four categories by their number of businesses. They are 1000-2000, 2000-5000, 5000-10000 and 10000-30000. I use different colors to represent different sizes to make ease for the readers.
def get_index(dataframe, lim):
index_list=[]
for index, row in dataframe.iterrows(): # iterate over rows in dataframe
if row['city_count'] > lim[0] and row['city_count'] < lim[1]:
index_list.append(index)
return [min(index_list), max(index_list)+1]
limits = [(1000,2000),(2000,5000),(5000,10000),(10000,30000)]
colors = ["royalblue","crimson","lightseagreen","orange"]
fig = go.Figure()
for i in range(len(limits)):
lim = limits[i]
curr = get_index(city_count, lim)
df_sub = city_count[curr[0]:curr[1]]
fig.add_trace(go.Scattergeo(
lon = df_sub['longtitude'],
lat = df_sub['latitude'],
text = df_sub['city'],
marker = dict(
size = df_sub['city_count']/50,
color = colors[i],
line_color='rgb(40,40,40)',
line_width=0.5,
sizemode = 'area'
),
name = '{0} - {1}'.format(lim[0],lim[1])))
fig.update_layout(
title_text = 'Top 20 Cities with the most number of Businesses<br>(Click legend to toggle traces)',
showlegend = True,
geo = dict(
scope = 'north america',
landcolor = 'rgb(217, 217, 217)',
)
)
fig.show()
The top 20 cities are clustered similarly to the states and provinces. Las Vegas has the most businesses in the US, which is around 30,000, and Toronto has the most businesses in Canada, which is around 19,000.
There are many bubbles overlapping each other, which implies one of the limitations of this map. The cities listed in the dataset may not always be a large city, and some satellite cities are often considered as their main city. We need to think about if we want to consider them as one or separately.

First, I apply the function value_count by categories to dataframe business to get an overview about the most frequent business categories.
business['categories'].value_counts().head(30)
From the count table above, we notice that out of these most popular categories, some of them are counted twice due to the order of the composition of the categories. Thus, it's reasonable to treat them as the same categories and use consistent names of each category. Since we are only investigating the most frequent business categories overall and the dataset is too large, consisting of 93386 categories, I am only going to extract the top 50 categories and rename them.
categories_50 = business['categories'].value_counts().head(50).to_frame().index.tolist()
def get_correct_categories(categories):
correct_categories = []
for cat in categories:
if len(correct_categories) == 0:
correct_categories.append(cat)
if cat not in correct_categories:
max_score = 0
for correct in correct_categories:
ratio = fuzz.token_sort_ratio(correct, cat)
if ratio > max_score:
max_score = ratio
if max_score < 95:
correct_categories.append(cat)
return correct_categories
correct_categories = get_correct_categories(categories_50)
correct_categories
def fuzz_m(col, cat_list, score_t):
new_name, score = process.extractOne(col, cat_list, scorer=score_t)
if score<95:
return col
else:
return new_name
business['categories'] = business['categories'].astype(str)
business['new_cat'] = business['categories'].apply(fuzz_m, cat_list=correct_categories, \
score_t=fuzz.token_sort_ratio)
I wrote a function to gather all the unique category names. I also wrote a function to replace the categories with the correct ones and saved them as new_cat. By extracting the top 20 categories and recording the number of each category, I created a new dataframe called cat_count that has the information about the names, the counts and whether the category is food related or not.
business['cat_count'] = business.groupby('new_cat')['new_cat'].transform('count')
cat_count = business[['new_cat','cat_count']].drop_duplicates(subset=['new_cat'])
cat_count = cat_count.sort_values(by=['cat_count'], ascending=False)
cat_count = cat_count.head(20).reset_index(drop=True)
cat_count['Food Related'] = ['yes', 'no', 'no', 'yes', 'yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'yes', \
'no', 'no', 'yes', 'no', 'yes', 'no', 'yes']
cat_count['Color'] = np.where(cat_count['Food Related'] == 'yes', 'lightseagreen',
np.where(cat_count['Food Related'] == 'no', 'orange', None))
cat_count
color = cat_count['Color'].to_list()
y = cat_count['new_cat'].to_list()
x = cat_count['cat_count'].to_list()
labels = cat_count['Food Related'].to_list()
fig, ax = plt.subplots(figsize=(10, 8))
for j in range(len(cat_count)):
ax.barh(y[j], x[j], align='center', color=color[j], alpha=0.6, label=labels[j])
ax.invert_yaxis()
plt.xlabel('count', fontsize=14)
plt.ylabel('category', fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=12)
leg = ax.legend (['yes', 'no'], title='Food Related', fontsize=14)
leg.set_title('Food Related',prop={'size':14})
plt.title('Bar plot of the Top 20 Business Categories', fontsize=20)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
# set individual bar lables using above list
for i in ax.patches:
# get_width pulls left or right; get_y pushes up or down
ax.text(i.get_width()+.1, i.get_y()+.55, \
str(round((i.get_width()), 2)), fontsize=10, color='dimgrey')
plt.show()
food = cat_count.groupby(by=['Food Related']).sum()
sizes = food['cat_count'].to_list()
labels = food.index.values.tolist()
fig1, ax1 = plt.subplots(figsize=(5, 5))
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90, textprops={'fontsize': 14})
plt.title('Pie Chart of the Percentage of Food Related', fontsize=15)
plt.show()
As shown in the bar chart, the most popular business category is pizza restaurant. The second most popular one is real estate and the third most popular one is nail salon. All of the three have around 2000 businesses in the dataset. Out of the top 20 business categories, half of them are food related.
As shown in the pie chart, the percentages of food-related businesses and non-food-related businesses are almost the same. This is surprising because Yelp was frequently used for food reviews. When you go onto the yelp main page, the backgroud image is a restaurant, and the category icons indicated on the main page are all food-related. Thus, I wonder if this will change region by region and I will further investigate it.
The original dataset business has a column called attributes.BikeParking. This column indicates whether or not the business has bike parking. By only selecting the rows where attributes.BikeParking is True, I create a new dataframe called bike_parking that only consists of the businesses with bike parking.
bike_parking = business[business['attributes.BikeParking'] == 'True']
top_15_businesses = bike_parking['name'].value_counts().head(15).to_dict()
bike_parking['name'].value_counts().head(15)
bike_parking['new_cat'].value_counts().head(15)
When I apply the value_count function by business names and categories to bike_parking dataset, I obtain the following result. The business with the most bike parking location is Starbucks. Also, the category with the most bike parking location is Food, Coffee & Tea. The results correspond to each other, and it's not surprising that coffee shops has the most number of bike parking locations because people tend to grab a cup of coffee and some snacks when they pass by a coffee shop, and bike parking engage more people to stop by at the coffee shop.
Overall, the top 15 businesses and categories which have a bike parking are mostly food-related businesses. For the businesses, 12 out of 15 are food-related. For categories, 11 out of 15 are food related.
I think it's also interesting to see for the top 15 businesses with bike parking, how many of their branches do not offer a bike parking?
I create a new dataframe called no_bike_parking that contains the parking information about the top 15 businesses with bike parking. I calculate their bike parking rate and save them in a new column called Bike Parking Rate and plot a bar chart of their bike parking rate.
no_bike_parking = business[business['attributes.BikeParking'] == 'False']
top_15_no_parking = no_bike_parking[no_bike_parking['name'].isin(top_15_businesses.keys())]
top_15_no_parking = top_15_no_parking.groupby(by=['name']).size().to_frame(name='No Bike Parking')
def get_bike_parking(df, d):
result =[]
for index, row in df.iterrows():
for key in d:
if index == key:
result.append(d[key])
return result
top_15_no_parking['Bike Parking'] = get_bike_parking(top_15_no_parking, top_15_businesses)
top_15_no_parking['Total'] = top_15_no_parking['Bike Parking'] + top_15_no_parking['No Bike Parking']
top_15_no_parking['Bike Parking Rate'] = top_15_no_parking['Bike Parking'] / top_15_no_parking['Total']
top_15_no_parking = top_15_no_parking.sort_values(by=['Bike Parking Rate'], ascending=False)
top_15_no_parking
y = top_15_no_parking.index.to_list()
x = top_15_no_parking['Bike Parking Rate'].to_list()
fig, ax = plt.subplots(figsize=(10, 8))
ax.barh(y, x, align='center', alpha=0.5, height=0.6)
ax.invert_yaxis()
plt.xlabel('Bike Parking rate', fontsize=14)
plt.ylabel('Business', fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=12)
# set individual bar lables using above list
for i in ax.patches:
# get_width pulls left or right; get_y pushes up or down
ax.text(i.get_width()+.004, i.get_y()+.4, \
str(round((i.get_width()), 3)), fontsize=10, color='dimgrey')
plt.title('Bar plot of the Bike Parking Rate by Top 15 Businesses with Bike Parking', fontsize=15)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
Overall, the bike parking rate of those 15 businesses are very high, and all of them exceed 60%. Starbucks have the highest bike parking rate of almost 84%. Considering the branding of starbucks, in our daily life, we notice that Starbucks are located mostly in the following two areas. The first one is inside a building, such as malls and office buildings because it wants to attract more customers who are most in need of beverages. The second area is a plaza or at the corners of streets where it can have some outdoor space open during summer. Both areas are likely to have a bike parking, and this explains why Starbucks has the highest bike parking rate.
I also want to investigate if the bike parking rate varies by states. I create a new dataframe called bike_parking_by_state that contains the parking information about the states with bike parking. Since there are a few state with less than 22 businesses, I decide to remove tnem for a more accurate result. I calculate their bike parking rate and save them in a new column called Bike Parking Rate and plot a bar chart of their bike parking rate.
bike_parking_by_state = bike_parking.groupby(by=['state']).size().to_frame(name='Bike Parking')
all_bike_state = business[business['state'].isin(bike_parking['state'])]
bike_parking_by_state['All Businesses'] = all_bike_state.groupby(by=['state']).size().to_list()
bike_parking_by_state = bike_parking_by_state[bike_parking_by_state['All Businesses'] > 22]
bike_parking_by_state['Bike Parking rate'] = bike_parking_by_state['Bike Parking'] / \
bike_parking_by_state['All Businesses']
bike_parking_by_state = bike_parking_by_state.sort_values(by=['Bike Parking rate'], ascending=False)
bike_parking_by_state
y = bike_parking_by_state.index.to_list()
x = bike_parking_by_state['Bike Parking rate'].to_list()
fig, ax = plt.subplots(figsize=(10, 5))
ax.barh(y, x, align='center', alpha=0.5, height=0.6)
ax.invert_yaxis()
plt.xlabel('Bike Parking rate', fontsize=14)
plt.ylabel('State', fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=12)
# set individual bar lables using above list
for i in ax.patches:
# get_width pulls left or right; get_y pushes up or down
ax.text(i.get_width()+.004, i.get_y()+.4, \
str(round((i.get_width()), 3)), fontsize=10, color='dimgrey')
plt.title('Bar plot of the Bike Parking Rate by States', fontsize=20)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
As shown in the graph, Wisconsin has the highest bike parking rate of 44.2% and South Carolina has the lowest bike parking rate of 28.7%. I think this is possibly due to the lack of public transit in Wisconsin. I find an article online saying that Wisconsin's public transportation systems claims inadequate funding has prevented many communities from expanding service. The link is here: https://www.wpr.org/new-report-says-stagnant-funding-public-transit-has-limited-wisconsins-transportation-options. I think this might explain why the bike parking rate in Wisconsin is so high compared to the other states. The lack of public transit forces more people to use bikes to get around the area.
I notice that in the original dataset, there is a variable called review_count recording the number of reviews for each business. I decide to remove the businesses with review_count less than the 10th percentile and greater than the 90th percentile. This will make the pattern more clear. I decide to plot multiple box plots of review counts by stars to see if there exists any relationship between review counts and rating.
# Removing outliers
q1 = business['review_count'].quantile(0.1)
q3 = business['review_count'].quantile(0.9)
review_stars = business[~((business['review_count'] < q1) \
|(business['review_count'] > q3))].reset_index(drop=True)
plt.figure(figsize=(15,10))
sns.boxplot(x = 'stars' , y = 'review_count', data = review_stars)
plt.xlabel('Stars', fontsize=14)
plt.ylabel('Review Count', fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title('Box plot of Review Counts by Stars', fontsize=20)
sns.despine()
plt.show()
The median of the review count increases as rating increases, and is highest at 4 stars and 4.5 stars. At 5 stars, the review count median decreases to around 5 reviews. I think this is because we remove the businesses with a large number of reviews. Overall, the review count increases as rating increases, indicating that there might be a positive relationship between review count and ratings. However, we don't know if there is a causal relationship, and which factor is the cause.
According to the Wikipedia of Greater Toronto Arae (link: https://en.wikipedia.org/wiki/Greater_Toronto_Area), the postal codes of GTA starts with L or M. By such definition, I create a new dataframe called gta by extracting businesses with postal codes starting with L or M
gta = business[(business['postal_code'].str[0] == 'M')|(business['postal_code'].str[0] == 'L')]
gta
Similar to Part 1, we notice that out of these most popular categories, some of them are counted twice due to the order of the composition of the categories. As I do in Part 1, I am only going to extract the top 50 categories and rename them by using the same get_correct_categories function.
categories_50_gta = gta['categories'].value_counts().head(50).to_frame().index.tolist()
correct_categories_gta = get_correct_categories(categories_50_gta)
correct_categories_gta
gta['categories'] = gta['categories'].astype(str)
gta['new_cat'] = gta['categories'].apply(fuzz_m, cat_list=correct_categories_gta, \
score_t=fuzz.token_sort_ratio)
By extracting the top 20 categories and recording the number of each category, I created a new dataframe called gta_cat_count that has the information about the names, the counts and whether the category is food related or not.
gta['cat_count'] = gta.groupby('new_cat')['new_cat'].transform('count')
gta_cat_count = gta[['new_cat','cat_count']].drop_duplicates(subset=['new_cat'])
gta_cat_count = gta_cat_count.sort_values(by=['cat_count'], ascending=False)
gta_cat_count = gta_cat_count.head(20).reset_index(drop=True)
gta_cat_count['Food Related'] = ['yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', \
'yes', 'yes', 'no', 'yes', 'yes', 'yes', 'no']
gta_cat_count['Color'] = np.where(gta_cat_count['Food Related'] == 'yes', 'lightseagreen',
np.where(gta_cat_count['Food Related'] == 'no', 'orange', None))
gta_cat_count
color = gta_cat_count['Color'].to_list()
y = gta_cat_count['new_cat'].to_list()
x = gta_cat_count['cat_count'].to_list()
labels = gta_cat_count['Food Related'].to_list()
fig, ax = plt.subplots(figsize=(10, 8))
for j in range(len(cat_count)):
ax.barh(y[j], x[j], align='center', color=color[j], alpha=0.6, label=labels[j])
ax.invert_yaxis()
plt.xlabel('count', fontsize=14)
plt.ylabel('category', fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=12)
leg = ax.legend (['yes', 'no'], title='Food Related', fontsize=14)
leg.legendHandles[1].set_color('orange')
leg.set_title('Food Related',prop={'size':14})
plt.title('Bar plot of the Top 20 Business Categories in GTA', fontsize=20)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
# set individual bar lables using above list
for i in ax.patches:
# get_width pulls left or right; get_y pushes up or down
ax.text(i.get_width()+.1, i.get_y()+.55, \
str(round((i.get_width()), 2)), fontsize=10, color='dimgrey')
plt.show()
food = gta_cat_count.groupby(by=['Food Related']).sum()
sizes = food['cat_count'].to_list()
labels = food.index.values.tolist()
fig1, ax1 = plt.subplots(figsize=(5, 5))
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90, textprops={'fontsize': 14})
plt.title('Pie Chart of the Percentage of Food Related in GTA', fontsize=15)
plt.show()
As shown in the bar chart, the most popular business category is coffeem, unlike in part 1 it is pizsa restaurant. The second most popular one is Chinese restaurant and the third most popular one is nail salon. All of the three have around 500 businesses in the dataset. Out of the top 20 business categories, 16 of them are food related, unlike only half of them are food related in part 1.
As shown in the pie chart, the percentages of food-related businesses is a lot higher than the percentage of non-food-related businesses. This is very different from part 1, where their percentages are almost the same. This implies GTA people are more into foods than the average of the dataset.
We notice that the top three cuisines are Chinese food, Japanese food and Indian food. This is possibly due to the large number of Chinese, Japanese and Indian immigrants in GTA. According to the Wikipedia of the Demographics of Toronto(link: https://en.wikipedia.org/wiki/Demographics_of_Toronto), the 2016 Census indicates that 51.5% of Toronto's population is composed of visible minorities. Annually, almost half of all immigrants to Canada settle in the Greater Toronto Area. 12.6% of the population is South Asian, 11.1% of the population is Chinese, and 0.5% of the population is Japanese. This could explain why GTA has so many foreigh cusine restaurants.

First, I apply the value_count function by the names of the businesses to get an overview of the data. I instantly notice that there are two names of Tim Hortons. Thus, I decide to write a function to get the correct list of names and rename only the top 50 franchises since we are only studying the top franchises in the city.
gta['name'].value_counts().head(50)
def get_correct_names(names):
correct_names = []
for name in names:
if len(correct_names) == 0:
correct_names.append(name)
if name not in correct_names:
max_score = 0
for correct in correct_names:
ratio = fuzz.ratio(correct, name)
if ratio > max_score:
max_score = ratio
if max_score < 95:
correct_names.append(name)
return correct_names
names_50 = gta['name'].value_counts().head(50).to_frame().index.tolist()
correct_names = get_correct_names(names_50)
gta['name'] = gta['name'].astype(str)
gta['new_name'] = gta['name'].apply(fuzz_m, cat_list=correct_names, score_t=fuzz.ratio)
gta.head()
gta['new_name'].value_counts().head(15)
After correcting the names of the businesses, I extract the top 15 businesses and create a variable called count to record the number of each franchises. I also create a variable called Canadian Brand to record the number of each franchise.
gta_top_15 = gta.groupby(['new_name']).size().to_frame(name='count').sort_values(by=['count'], ascending=False)\
.head(15)
gta_top_15['Canadian Brand'] = ['yes', 'no', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'yes', \
'no', 'no', 'yes']
gta_top_15
canada = gta_top_15.groupby(by=['Canadian Brand']).sum()
sizes = canada['count'].to_list()
labels = canada.index.values.tolist()
fig1, ax1 = plt.subplots(figsize=(5, 5))
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90, textprops={'fontsize': 14})
plt.title('Pie Chart of the Percentage of Canadian Franchises in GTA', fontsize=15)
plt.show()
By plotting a pie char, 55.9% of the businesses are Canadian franchises and 44.1% of the businesses are non-Canadian franchises. Out of the 15 franchises, 9 of them are Canadian franchises. This makes sense because in GTA, where Canadian people are highly crowded, people tend to support their own nation brands. By looking at of the categories these franchises, Americans definitely have their own brands for the same category. The categories are very simple, and most of them are foods.
Out of all the non-Canadian brands, Starbucks and McDonald's have the most number of businesses in GTA. This is not surprising because both of them are large transnational corporations and their businesses are all over the world. Geographically, Canada and the US are very close to each other and thus we share a lot same taste. This explains why Starbucks and McDonald's are also popular in GTA.
First, according to Canada Post, the postal codes of the rural area in Canada has 0 as the second digit. Thus, we can differentiate the urban and rural area by their postal codes. I will study if the rating differs between the two areas.
rural = gta[gta['postal_code'].str[1] == '0']
rural['location'] = 'rural'
urban = gta[gta['postal_code'].str[1] != '0']
urban['location'] = 'urban'
rural_urban = pd.concat([rural, urban])
plt.figure(figsize=(8,8))
sns.boxplot(x='location', y = 'stars', data = rural_urban)
Surprisingly, the distrubution of ratings in rural area and in urban areas is almost exactly the same, except for that urban area has more outliers at 1 star. Both of them have the median of 3.5 stars, the Q1 of 1.5 stars, and the Q3 of 4.0 stars.
This result may not be very accurate. First, the definition of rural and urban areas is very ambigious, and it is not up to date. Some locations are marked as rural, but with the rapid development, have a lot of population and constructions nowadays. Also, the ratings are actually discrete values, and the distribution of discrete values may not be very accurate.
First, I create three dataframes each containing all the Starbucks, Tim Hortons and both coffee shops.
tim_starbucks = gta[(gta['new_name'] == 'Tim Hortons') | (gta['new_name'] == 'Starbucks')]
tim_starbucks
starbucks = gta[(gta['new_name'] == 'Starbucks')]
starbucks
tims = gta[(gta['new_name'] == 'Tim Hortons')]
tims
To find the shortest distance of Starbucks and Tim Hortons, I write a function that return the shortest distance between a geolocation and all the locations in a dataframe. I then calculate the shortest distance of Starbucks and Tim Hortons and record them in dataframe new_tims_starbucks.
from geopy import distance
def get_shortest_distance(lat, long, df):
distances=[]
for index, row in df.iterrows():
distances.append(distance.distance((lat, long), (row['latitude'], row['longitude'])).km)
return min(distances)
distances = []
for index, row in starbucks.iterrows():
distances.append(get_shortest_distance(row['latitude'], row['longitude'], tims))
starbucks['shortest distance'] = distances
distances1 = []
for index, row in tims.iterrows():
distances1.append(get_shortest_distance(row['latitude'], row['longitude'], starbucks))
tims['shortest distance'] = distances1
new_tims_starbucks = pd.concat([tims, starbucks])
new_tims_starbucks = new_tims_starbucks.reset_index(drop=True)
new_tims_starbucks
I now classify them into four groups by their shortest distance to a nearby store. The groups are 0-0.5 km, 0.5-1 km, 1-2 km, and more than 2 km.
Before I make a conclusion on if it's true that for every Tim Hortons in the GTA there is a Starbucks nearby, I need to clarify the term "nearby". When I look up the dictionary, "nearby" means "close at hand; not far away". The average human walking speed at crosswalks is about 5.0 kilometres per hour (km/h). In this context, I consider two stores are "nearby" when the distance between them is less than 1km, which means that it takes about 12 minutes of walking.
new_tims_starbucks.loc[(new_tims_starbucks['shortest distance'] >= 0) & (new_tims_starbucks['shortest distance'] \
< 0.5), 'distance group'] = '0-0.5 km'
new_tims_starbucks.loc[(new_tims_starbucks['shortest distance'] >= 0.5) & (new_tims_starbucks['shortest distance'] \
< 1), 'distance group'] = '0.5-1 km'
new_tims_starbucks.loc[(new_tims_starbucks['shortest distance'] >= 1) & (new_tims_starbucks['shortest distance'] \
< 2), 'distance group'] = '1-2 km'
new_tims_starbucks.loc[(new_tims_starbucks['shortest distance'] >= 2), 'distance group'] = 'more than 2 km'
sums = new_tims_starbucks.groupby(['distance group']).size()
# Pie chart
labels = ['0-0.5 km', '0.5-1 km', '1-2 km', 'more than 2 km']
#colors
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']
fig1, ax1 = plt.subplots()
patches, texts, autotexts = ax1.pie(sums, colors = colors, labels=labels, autopct='%1.1f%%', startangle=90)
for text in texts:
text.set_color('black')
for autotext in autotexts:
autotext.set_color('black')
# Equal aspect ratio ensures that pie is drawn as a circle
ax1.axis('equal')
plt.tight_layout()
plt.show()
As shown in the pie chart, 48.9% of Starbucks and Tim Hortons are within 0.5km to each other, and 16.3% of them are between 0.5km to 1km away from each other. Thus, in this dataset, 65.2% of them are considered nearby to each other. Only, 15.4% of them are more than 2km away from each other, which means that people have to use other transportation tools to get there. Since the percentage of "nearby" is greater than 50%, we can conclude that it's true that for most Tim Hortons in the GTA there is a Starbucks nearby, but not all of them.
Now, I will plot all the Starbucks and Tim Hortons on the map. I get the image of the map from OpenStreetMap, https://www.openstreetmap.org/export#map=10/43.7557/-79.4051.
BBox = ((tim_starbucks.longitude.min(), tim_starbucks.longitude.max(),
tim_starbucks.latitude.min(), tim_starbucks.latitude.max()))
BBox
gta_map = plt.imread('map.png')
fig, ax = plt.subplots(figsize = (30,20))
ax.scatter(starbucks.longitude, starbucks.latitude, zorder=2, alpha=1, c='b', s=20)
ax.scatter(tims.longitude, tims.latitude, zorder=2, alpha= 1, c='r', s=20)
ax.set_title('Plotting Starbucks and Tim Hortons on Map', fontsize=30)
ax.set_xlim(BBox[0],BBox[1])
ax.set_ylim(BBox[2],BBox[3])
ax.legend(['Starbucks', 'Tim Hortons'], fontsize=20)
plt.xlabel('longitude', fontsize=20)
plt.ylabel('latitude', fontsize=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
ax.imshow(gta_map, zorder=1, extent = BBox, aspect= 'equal')
As shown in the map, most of the Starbucks and Tim Hortons are located in downtown Toronto, especially in the financial district where coffee is almost a routine for people working there. Even though both of them are all over GTA, we can tell from the map that the larger the city is, the more Starbucks and Tim Hortons there are.
Toronto's Central Business District is roughly bounded by Yonge St., Front St., University Ave. and Queen St. The headquarters of most major Canadian corporations are located there.
toronto_tim_starbucks = tim_starbucks[tim_starbucks['city'] == 'Toronto']
toronto_starbucks = starbucks[starbucks['city'] == 'Toronto']
toronto_tims = tims[tims['city'] == 'Toronto']
bound = (-79.3868, -79.3769, 43.6455, 43.6536)
cbd_map = plt.imread('cbd_map.png')
fig, ax = plt.subplots(figsize = (10,10))
ax.scatter(toronto_starbucks.longitude, toronto_starbucks.latitude, zorder=2, alpha=1, c='b', s=20)
ax.scatter(toronto_tims.longitude, toronto_tims.latitude, zorder=2, alpha= 1, c='r', s=20)
ax.set_title('Plotting Starbucks and Tim Hortons on Map', fontsize=20)
ax.set_xlim(bound[0],bound[1])
ax.set_ylim(bound[2],bound[3])
ax.legend(['Starbucks', 'Tim Hortons'], fontsize=15)
plt.xlabel('longitude', fontsize=15)
plt.ylabel('latitude', fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
ax.imshow(cbd_map, zorder=1, extent = bound, aspect= 'equal')
json = 'yelp_dataset/review.json'
csv = 'review.csv'
column_names = get_superset_of_column_names_from_file('yelp_dataset/review.json')
read_and_write_file(json, csv, column_names)
review = pd.read_csv("review.csv")
review.head()
I extract all the reviews in GTA, and count the number of reviews of each user, saved as count in gta_review.
gta_review = review[review['business_id'].isin(gta['business_id'])]
gta_review['count'] = gta_review.groupby('user_id')['user_id'].transform('count')
review_count = gta_review[['user_id','count']].drop_duplicates(subset=['user_id'])
review_count.describe()
Overall, most of the users only post 1 review. Tha maximum review number is 148297, which must be identified as an outlier or a fake reviewer. Averagely, each user post around 5 reviews. More than 50% of the reviewers in the official dataset provide only a single review. 75% of the users post 3 or less than 3 reviews.
This shows that Yelp users are not very engaged in reviewing the businesses. Instead, Yelp is used more as a tool to gather information about the businesses. As the survey in the class, only 1 out of 25 students have posted something on Yelp, so the result of the summary table above is not so surprising.
review_count = review_count.sort_values(by=['count'], ascending=False)
percentile_90 = review_count['count'].quantile(0.90)
review_count[review_count['count'] > percentile_90]
percentage_90 = review_count[review_count['count'] > percentile_90]['count'].sum()/len(gta_review)
percentage_90
percentile_95 = review_count['count'].quantile(0.95)
review_count[review_count['count'] > percentile_95]
percentage_95 = review_count[review_count['count'] > percentile_95]['count'].sum()/len(gta_review)
percentage_95
percentile_99 = review_count['count'].quantile(0.99)
review_count[review_count['count'] > percentile_99]
percentage_99 = review_count[review_count['count'] > percentile_99]['count'].sum()/len(gta_review)
percentage_99
Here, I calculate the percentage of the number of reviews od the top users and all the reviews. The top 10% users with most number of reviews are responsible for around 63% of the total reviews. The top 5% users with most number of reviews are responsible for around 52% of the total reviews. The top 1% users with most number of reviews are responsible for around 30% of the total reviews.
This trend is not so surprising because fake reviewers are common on Yelp to increase the rating of businesses and thus attract more customers. And I will discuss this further.
I first create dataframes to save reviews for Starbucks and Tim Hortons in GTA.
gta_starbucks_review = gta_review[(gta_review['business_id'].isin(starbucks['business_id']))]
gta_starbucks_review
gta_tims_review = gta_review[(gta_review['business_id'].isin(tims['business_id']))]
gta_tims_review
I extracted all reviews of Starbucks and Tim Hortons in GTA. I also removed all the stop words from the texts of the reviews to increase efficiency when processing the data. I then calculated the sentiment scores and saved them in new columns.
stop_words = stopwords.words('english')
gta_starbucks_review['text_without_stopwords'] = gta_starbucks_review['text'].apply\
(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
gta_tims_review['text_without_stopwords'] = gta_tims_review['text'].apply\
(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
sia = SIA()
gta_tims_review['Positivity'] = gta_tims_review['text_without_stopwords'].apply\
(lambda Text: pd.Series(sia.polarity_scores(Text)['pos']))
gta_tims_review['Negativity'] = gta_tims_review['text_without_stopwords'].apply\
(lambda Text: pd.Series(sia.polarity_scores(Text)['neg']))
gta_tims_review['Neutrality'] = gta_tims_review['text_without_stopwords'].apply\
(lambda Text: pd.Series(sia.polarity_scores(Text)['neu']))
gta_starbucks_review['Positivity'] = gta_starbucks_review['text_without_stopwords'].apply\
(lambda Text: pd.Series(sia.polarity_scores(Text)['pos']))
gta_starbucks_review['Negativity'] = gta_starbucks_review['text_without_stopwords'].apply\
(lambda Text: pd.Series(sia.polarity_scores(Text)['neg']))
gta_starbucks_review['Neutrality'] = gta_starbucks_review['text_without_stopwords'].apply\
(lambda Text: pd.Series(sia.polarity_scores(Text)['neu']))
f, axes = plt.subplots(2, 3, figsize=(14,10))
sns.boxplot(x = 'stars' , y = 'Positivity', data = gta_starbucks_review, ax=axes[0, 0])
sns.boxplot(x = 'stars' , y = 'Negativity', data = gta_starbucks_review, ax=axes[0, 1])
sns.boxplot(x = 'stars' , y = 'Neutrality', data = gta_starbucks_review, ax=axes[0, 2])
sns.boxplot(x = 'stars' , y = 'Positivity', data = gta_tims_review, ax=axes[1, 0])
sns.boxplot(x = 'stars' , y = 'Negativity', data = gta_tims_review, ax=axes[1, 1])
sns.boxplot(x = 'stars' , y = 'Neutrality', data = gta_tims_review, ax=axes[1, 2])
axes[0,0].set_title('Starbucks Positivity', fontsize=12)
axes[0,1].set_title('Starbucks Negativity', fontsize=12)
axes[0,2].set_title('Starbucks Neutrality', fontsize=12)
axes[1,0].set_title('Tim Hortons Positivity', fontsize=12)
axes[1,1].set_title('Tim Hortons Negativity', fontsize=12)
axes[1,2].set_title('Tim Hortons Neutrality', fontsize=12)
plt.suptitle('Box Plots of Sentiment Scores of Starbucks and Tim Hortons Reviews', fontsize=20)
sns.despine()
f.savefig('sentiments.png', dpi=200)
As shown in the graph, the positive score medium increases as rating increases for both Starbucks and Tim Hortons, and the negative score medium decreases as rating increases. This makes sense because when the customers are satisfied with the businesses, they tend to use more positive language to describe the experiences, contrasting to negative reviews where customers tend to use more negative words to complain about the services. The variance in positive scores is highest at 5 stars and also increases as rating increases. This may be because when people are praising the business, they are more likely to use various languages and write longer reviews, but when people are not satisfied, their reviews tend to be short and only consist of certain positive words. A similar trend is also in the negative scores for both businesses.
The neutral score is quite different. Overall, the medium decreases as the rating increases. The variance is almost the same among all ratings, except for 5 stars. The medium is highest between 2 stars to 3 stars. This makes sense because when customers are not very satisfied with the business, they tend to list facts in the reviews to support their claims, while when people are satisfied, they tend to use more positive words to show their love.
Overall, the distribution of sentiment scores of the reviews is similar in both Starbucks and Tim Hortons. This can somehow imply that customers use similar language when reviewing both, but not completely. The sentiment score can show customers’ attitudes towards businesses. However, some limitations include that the dataset is too small and only consists of almost 5000 reviews. Also, removing the stops words may result in inaccuracy in sentiment scores.
First, I will list some characteristics of fake reviews.
Thus, I think it's helpful to use a classification tree to determine if a review is fake based on the characteristics above. However, I couldn't figure it out right now and I will comeback to it later.
The results and conclusions for each question is listed above. However, I want to talk about the entire project in an overall scope and the limitations of my research.
First, I remove many rows to facilitate the data wrangling process. I believe some of them may not be very true and may cause inaccuracy in my results.
Also, I only use simple visualization to show the data, and I believe there are more statistical methods and visualizations that I can use to improve the quality of my report as I keep learning.